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Abstract 

Although microorganisms play crucial roles in ecosystems, metagenomic analyses 
of soil samples are quite scarce, especially in the Southern Hemisphere. In this 
work, the microbial diversity of soil samples from an Atlantic Forest and Caatinga 
was analyzed using a metagenomic approach. Proteobacteria and Actinobacteria 
were the dominant phyla in both samples. Among which, a significant proportion 
of stress-resistant bacteria associated to organic matter degradation was found. 
Sequences related to metabolism of amino acids, nitrogen, and DNA and stress 
resistance were more frequent in Caatinga soil, while the forest sample showed the 
highest occurrence of hits annotated in phosphorous metabolism, defense mecha- 
nisms, and aromatic compound degradation subsystems. The principal component 
analysis (PCA) showed that our samples are close to the desert metagenomes in 
relation to taxonomy, but are more similar to rhizosphere microbiota in relation to 
the functional profiles. The data indicate that soil characteristics affect the taxo- 
nomic and functional distribution; these characteristics include low nutrient con- 
tent, high drainage (both are sandy soils), vegetation, and exposure to stress. In 
both samples, a rapid turnover of organic matter with low greenhouse gas emission 
was suggested by the functional profiles obtained, reinforcing the importance of 
preserving natural areas. 
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Introduction 

Despite soils being the largest reserve of microbial biodi- 
versity, our knowledge about their genetic pool is still 
limited (Whitman et al. 1998; Torsvik et al. 2002; Curtis 
and Sloan 2004; Mocali and Benedetti 2010). In general, 
soil microbial diversity has been mainly accessed through 
methods based on 16S gene analysis that allow for esti- 
mating the diversity richness, but fail in the identification 



of functional attributes. Thus, metagenomic studies have 
contributed toward the understanding of microbial diver- 
sity allowing for the identification of taxa and functional 
profiles, as the abundance of determinate genes have been 
used as an indicator of biogeochemical processes (Morales 
et al. 2010; Brankatschk et al. 2013). 

Soil characteristics are important factors that affect the 
microbial diversity and the dynamics of biogeochemical 
cycles. As an example, soil pH and nutrient availability have 
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been considered the main factors driving the composition of 
soil bacterial communities (Lauber et al. 2009; Goldfarb 
et al. 2011; Griffiths et al. 2011; Kuramae et al. 2012). In 
addition, high temperature, moisture, low pH, and carbon 
and nitrogen availability contribute to the increase in N 2 0 
emission from soil, but aeration and drought cause reduc- 
tion in this process (Sangeetha et al. 2009; Cleveland et al. 
2010; Saggar et al. 2013). Even so, our knowledge about bio- 
geochemical processes is still scarce, mainly in relation to 
tropical soils, and questions about how the soil characteris- 
tics together determine the microbial diversity and biogeo- 
chemical processes remains unclear. 

With climate changes and the increase in greenhouse gas 
emission, the understanding of microbial diversity and its 
involvement in these processes will be important for ade- 
quate soil management. The highest emissions of CH 4 and 
N 2 0 are from tropical soils, but these soils also present high 
consumption of C0 2 (Montzka et al. 2011). However, the 
majority of these data are from humid tropical regions and 
little is known about semiarid tropical soils. 

In this work, the microbial diversity of soil samples 
from Parque das Dunas (PD), as a representative of 
Atlantic Forest biome, and Joao Camara city, as represen- 
tative of Caatinga biome, both located in the State of Rio 
Grande do Norte (RN, Brazil), were investigated using a 
metagenomic approach. The Atlantic forest is the third 
largest Brazilian biome, comprising particular ecosystems 
such as mangroves and forests which span almost the 
entire Brazilian coast, and is the second largest humid 
tropical forest in South America. However, less than 10% 
of the native forest is preserved (Myers et al. 2000) and 
few works have investigated their microbial diversity. The 
analysis of 16S gene in soil samples from Atlantic Forest 
of Parana' and Rio de Janeiro showed a dominancy of Ac- 
idobacteria and Proteobacteria (Bruce et al. 2010; Faoro 
et al. 2010). Concerning to gas emission, the analysis of 
soil samples of Atlantic Forest from Sao Paulo showed 
that N 2 0 emission and the CH 4 uptake are within the 
range of other tropical forests of the world. The authors 
also observed that N 2 0 and C0 2 emissions were lower at 
higher altitudes, which may be associated to the tempera- 
ture decrease (Sousa Neto et al. 2011). The PD in Natal 
city (northeastern Brazil) differs from other representa- 
tives of Atlantic Forest due to its soil characteristics. The 
forest growth on the dunes suggests the occurrence of a 
specific microbial diversity, differing of the microbiota 
observed in soil from Atlantic forest from Parana' and Rio 
de Janeiro, which are more clayey soils. 

Joao Camara (JC) sampling site is located in the semi- 
arid Caatinga, the only exclusive Brazilian biome. The Ca- 
atinga occupies 18% of the Brazilian territory, being the 
most populous semiarid region of the world. It presents a 
rich diversity of plants, but only 47% of the native vege- 



tation is preserved. Due to natural and anthropogenic fac- 
tors, the Caatinga is considered a fragile ecosystem 
subject to desertification. Despite the biological impor- 
tance, the microbial diversity of Caatinga soils is still 
unknown, but due to severe climate conditions (high 
temperature, high UV exposure, and long periods of 
drought), a low and specialized microbial diversity was 
estimated (Giongo et al. 2011; Menezes et al. 2012). 

Although the Atlantic Forest and Caatinga in northeast- 
ern Brazil have specific soil characteristics (both are sandy 
soils poor in nutrients) and an endemic flora, we investi- 
gated the hypothesis that soil microbes show a distinct 
taxonomic and functional composition in relation to 
other biomes. 

Reports of microbial diversity are scarce in the South- 
ern Hemisphere, providing an invaluable, interesting, and 
unexplored field of study for metagenomics. Therefore, to 
our knowledge, this is the first metagenomic study ever 
conducted in soils of Caatinga and Atlantic Forest biomes 
that describe taxonomic and metabolic profiles of the 
microbial community, as well as a comparative analysis 
between these two different environments and other pub- 
lic metagenomes from different biomes. 

Materials and Methods 
Sample collection 

Sampling was performed in early October 2009 in two dif- 
ferent regions of Rio Grande do Norte state, Brazil: the Par- 
que das Dunas (Park of Dunes, PD) in Natal city, an 
environmental conservation area of Atlantic Forest biome 
(1172 ha) covered by native coastal dune vegetation, and 
Joao Camara city (JC), a semiarid area of Caatinga biome 
73 km from Natal (Table 1 and Figure SI). The collection 
site at PD (S5° 50.530' 035° 11.598') is located on uneven 
ground, is grayish in color, under indirect sunlight, and has 
many roots. The collection site in JC (S5° 30' 51.81" 035° 
54' 17.13") is located on flat ground, is dry, dark brown, 
under direct sunlight, and has few roots (Table 1). The 
sampling followed Schneegurt et al. (2003) recommenda- 
tions. In brief, after the removal of roots, -500 g of soil 
samples were collected (depth 5-10 cm) using a sterile 
spatula and immediately transferred to sterile 50 mL tubes 
kept on ice. Organic and physical-chemical analyses of the 
soil samples were performed by the Laboratory of Soil, 
Water and Plants Analyses of EMPARN (Natal, RN, Brazil). 

DNA preparation, extraction, purification, 
and sequencing 

Initially, samples were sieved through 2 mm sterile sieves 
in order to eliminate undesired constituents such as roots 
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Table 1. General features of the studied areas. 





Parque das Dunas 


Joao Camara 




(PD) 


(JC) 


Biome 


Atlantic forest 


Caatinga 


Climate zone 


Humid tropical 


Semiarid 


Temperature (annual average) 


22.6-29. 2°C 


21-32°C 


UV 


I— linh hilt \r\r\\rort rl 1 10 to i"annn\/ 
niuii, uul ii lull cll uutr lu i~aiiu|jy 


I— linh snrl r\\rart 
myii di iu uiicll 


Rainfall (annual average) 


1600 mm 


648 mm 


Vegetation 


Predominantly arboreal, also presenting 


Xerophilous species, shrubs, thorny and 




shrubs and herbaceous. The main families 


deciduous small trees (Santos et al. 2010) 




found are Leguminosae, Myrtaceae, Gramineae 






(Poaceae), Compositae, Euphorbiaceae, Convolvulaceae, 






and Rubiaceae (Freire 1990) 




Soil 


Low compaction, sandy, grayish 


Compact, dry, dark brown, with few roots 



and tiny stones. Afterwards, 10 g of soil samples were sub- 
jected to direct extraction and purification of DNA using 
Power MaxTM Soil DNA Isolation Kit (MoBio Laborato- 
ries, Inc., Carlsbad, CA) following the manufacturer's 
instructions. 

For sequencing, the libraries were prepared following the 
instructions of the GS FLX titanium general library prepara- 
tion method manual (454-Roche), using 5 /ig DNA of each 
sample. The titration, emulsion PCR, and sequencing steps 
were performed according to the manufacturer's instruc- 
tions. A four-region 454 sequencing run was performed on 
a 70 x 75 PicoTiterPlate (PTP) using the Genome Sequen- 
cer FLX System (Roche Applied Science, Sao Paulo, Brasil) 
(Margulies et al. 2005). Each library was loaded onto one 
quarter of the plate. The sequences were deposited in Gen- 
Bank with accession numbers SRA026684 and SRA026685 
and on the MG-RAST server (4459906.3 and 4459907.3). 

The replicated sequences generated as artifact of the 
454-based pyrosequencing, were eliminated using the 
Replicates software (Gomez- Alvarez et al. 2009), and the 
sequences <120 bp were removed by the LUCY program 
(Chou and Holmes 2001). As a result, 27,618 and 22,611 
reads were removed in PD and JC, respectively. 

The assembly was conducted using the Newbler Assem- 
bler 2.5.3. Reads identified as Partial, Repeat, Outlier, Too- 
Short, and with high-quality discrepancies were filtered 
from the dataset. Assembling and filtering cycles were per- 
formed until the discrepancies were limited to 1% of the 
total of the reads filtered out at the first assembly step. How- 
ever, at the cutoff threshold no contigs were assembled. 

Taxonomic distribution and statistical 
analyses 

The taxonomic profiles of the metagenomic reads were 
assigned using the MG-RAST server. In MG-RAST, the 
species richness was computed as the antilog of the Shan- 
non diversity (Meyer et al. 2008). The abundance data 



was identified through the lowest common ancestor 
(LCA), with the parameters le~ 05 as the maximum e- 
value, a minimum identity of 60%, and a minimum align- 
ment length of 15 as cutoff. The statistical analysis for dis- 
tinct taxonomic levels from MG-RAST was conducted 
using the Statistical Analyses of Metagenomic Profiles 
(STAMP) (Parks and Beiko 2010) software. The signifi- 
cance of the relative proportion difference in taxonomic 
distribution of PD and JC samples was performed using 
the two-sided Fisher's exact test, with Newcombe-Wilson 
confidence interval method. Because P-values were not 
uniformly distributed using Storey's false discovery rate 
(FDR), Benjamin-Hochberg FDR was applied for correc- 
tion. Results with q < 0.05 were considered significant 
and the unclassified reads were removed from analyses. 
The biological relevance of the statistic taxa was deter- 
mined applying a difference between the proportions of at 
least 1% and a twofold ratio between the proportions. 

A taxonomic analysis was also conducted using the 
MEGAN4 (Huson et al. 2011) software. The given reads 
were compared against the NR and NT NCBI databases 
using the BLASTX and BLASTN algorithms (Altschul 
et al. 1997), respectively. Statistical tests on the taxonomic 
data were also performed with MEGAN. The PD and JC 
counts were normalized to produce data sets of 100,000 
reads. The analysis was performed comparing distinct 
hierarchical levels and directed homogeneity test was 
applied to highlight the significant differences in the sam- 
ple comparisons. The highlighting thickness is logarithmi- 
cally proportional to its significance; that is the thickness 
is an integer value of 2log x when P = l.Oex (Mitra et al. 
2009). Multiple testing correction analysis was not applied 
and all unassigned reads were ignored. 

Functional analyses 

Functional profiles were identified using the SEED subsys- 
tems annotation source of the MG-RAST, with le~ 05 as 
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maximum e-value, a minimum identity of 60%, and a 
minimum alignment length of 15. Distinct functional lev- 
els from MG-RAST were statistically analyzed in the 
STAMP, using the same parameters above described for 
taxonomic distribution. Through the workbench tool 
from MG-RAST server, we generated subsets of the reads 
annotated in a functional subsystem for taxonomic identi- 
fication. The metabolic pathways of biogeochemical cycles 
were generated using the Kyoto Encyclopedia of Genes 
and Genomes (KEGG) from MG-RAST server, with le~ 05 
maximum e-value cutoff, 60% of minimal identity, and a 
minimal alignment length of 15. 

A functional analysis using the SEED (Overbeek et al. 
2005) and KEGG (Kanehisa et al. 2012) databases was 
conducted using the MEGAN4 software. Each sequence 
was related to its SEED functional role using the best 
BLAST score to protein sequences without known func- 
tional roles. A similar procedure was used to match each 
sequence to a KEGG orthology (KO) accession number. 

A database was constructed using proteins' Refseqs for 
a number of key enzymes of different and important met- 
abolic pathways. A screening of functional key enzymes 
was conducted by BLAST using the Refseqs against the 
PD and JC metagenome. Matches with alignment scores 
higher than 80 were retained. 

Comparative metagenomic analysis 

The taxonomic and SEED subsystems profiles of metage- 
nomic samples from soil, water, and host-associated sam- 
ples were obtained from MG-RAST server. Samples 
belonging to rhizosphere, temperate and tropical forests, 
marine habitat, host-associated, and desert biomes were 
included. The criteria applied for inclusion were a maxi- 
mum e-value cutoff of le-05, a minimum identity of 
60%, and a minimum alignment length of 15. The me- 
tagenomes included in the analysis were 4440463.3, 
4440939.3, 4444130.3, 4444164.3, 4444165.3 (animal-asso- 
ciated habitat), 4465556.3, 4449956.3 (rhizosphere 
biome), 4477805.3, 4477872.3, 4477901.3, 4477904.3 (des- 



ert biome), 4443713.3, 4441057.4, 4441586.3, 4441578.3 
(marine habitat), 4477876.3, 4477877.3, 4477899.3 (tem- 
perate forest), 4477807.3 and 4477875.3 (tropical forest). 
Trends in the abundance of the taxonomy and the SEED 
subsystems were examined using Principal Component 
Analysis (PCA) through the multiple groups analysis of 
STAMP, in which the statistical test applied was analysis 
of variance with Games-Howell post hoc test and Benja- 
min-Hochberg FDR for correction. For the comparison 
between two groups, the Welch's t-test, the Welch's 
inverted test for confidence interval method and Benja- 
min-Hochberg FDR for correction were applied. The rel- 
ative proportion difference in functional distribution of 
PD and JC samples was considered significant when 
q < 0.05. The unclassified reads were removed from 
analyses. 

Results 

Organic and physical-chemical 
characteristics of soil samples 

The organic and physical-chemical parameters of PD and 
JC soil samples are summarized in Table 2. Both samples 
are classified as very acidic soils (pH < 5.0) and exhibit a 
large portion of sand, particularly in PD. However, the PD 
soil shows lower values for all the components analyzed 



Table 3. Taxonomic profile of PD and JC samples to domain level, 
computed by MEGAN and MG-RAST. 



Domain 


MEGAN 




MG-RAST 




PD 


JC 


PD 


JC 


Archaea 


266 


662 


232 


366 


Bacteria 


88,786 


106,396 


74,299' 


93,371' 


Eukaryota 


7489 


794 


5312' 


701' 


Viruses 


11 


25 


5' 


21' 


'Differences 


statistically 


significant between 


PD and JC 


samples 


(P< 1e~ 15 ) by STAMP. 









Table 2. Organic and physical-chemical parameters of PD and JC soil samples. 

Ca Mg Al H + Al P K Na N Organic matter 
C:N 

Sample pH in water (cmol.dirT 3 ) (mg.dm -3 ) (g.drrT 3 ) 

PD 4.99 0.30 0.14 0.10 1.32 2 12 4 0.31 11.26 21:1 

JC 4.60 9.5 23.5 2.25 7.76 4 305 199 1.10 24.6 12.7:1 



Granulometry (%) Sand Clay Silt Texture classification 

PD 97.9 2 0.1 Sandy 

JC 58.2 12 29.8 Sandy loam 
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and higher C:N ratio when compared to the JC sample, an 
indicator of a deficiency in nutrients and minerals. 

General characteristics of the metagenomes 

The soil DNA sequencing resulted in 147,278 and 151,274 
reads from PD and JC, with a total of 64,214,298 and 
68,328,253 bb, an average length of 436 ± 63 and 
451 ± 60 bp, and the GC content of 61 ± 8 and 
66 ± 7%, respectively (Table SI). Reads identified as arti- 
ficial duplicate was removed by the MG-RAST. After the 
quality control, 155,805 proteins were predicted for PD 
sample and -53% of the total reads were annotated as 
proteins functionally assigned. For JC metagenome, 
164,449 proteins were predicted and 61.6% of the total 
reads were identified functionally (Table SI). The algo- 
rithm implemented by MEGAN assigned more sequences 
when compared to MG-RAST and it was able to identify 
a higher number of sequences related to Bacteria, Ar- 
chaea, Eukarya, and Viruses (Tables SI and 3), probably 
due to the higher number of available sequences in the 
reference database. The species richness estimated through 
the a-diversity index showed that the metagenome from 
JC presents a-diversity of 448.587, whereas for PD the 
species richness is 441.864. 

Comparative taxonomic profiles 

For phylum level, the microbiota profile generated by 
MG-RAST was similar in the PD and JC samples. The 
more abundant phylum in both samples were Actinobac- 
teria (27.8% and 36.4%), Proteobacteria (26.1% and 



24.8%), and Acidobacteria (9.1% and 2.4%). However, 
the statistic difference was observed for Proteobacteria, 
Acidobacteria, and Chlamydia, more frequent in PD sam- 
ple, and Actinobacteria, Bacteroidetes, and Cyanobacteria 
with highest frequency in JC. Concerning Archaea and 
Eukarya phyla, the highest occurrence of Thaumarchaeota 
was observed in JC sample, while Ascomycota was pre- 
dominant in PD sample. The MEGAN analysis showed 
similar results for phylum representation (data not 
shown). 

The Actinobacteria (27.8% and 36.4%) and Alphaprote- 
obacteria (14.86% and 12.41%) were the predominant clas- 
ses in PD and JC. The classes Alphaproteobacteria, 
Solibacteres, and Acidobacteria are more frequent in PD 
sample (Fig. 1), with the highest representation of the 
orders Rhizobiales, Solibacterales and Acidobacteriales 
according to STAMP analysis. In JC sample, Actinobacte- 
ria, Deltaproteobacteria, and unclassified Cyanobacteria 
were predominant (Fig. 1), with overrepresentation of the 
orders Actinomycetales, Sphingomonadales, and Mixococ- 
cales. The statistical analyses implemented by MEGAN 
showed that PD metagenome has a significantly higher 
number of reads related to Acidobacteria, Alphaproteobac- 
teria, and Planctomycetia classes. Moreover, Betaproteo- 
bacteria, Gammaproteobacteria, and Deltaproteobacteria 
are statistically more represented in the JC metagenome 
(Figure S2). 

Both metagenomes showed a similar microbial compo- 
sition considering some of the most frequently found 
genera, despite the difference in their representation 
(Fig. 2). The PD sample showed predominance of Can- 
didatus solibacter, Candidatus koribacter, Acidobacterium 



95% confidence in 



Chloroflexi (class) \ 
Bacteroidetes L 
Deltaproteobacteria I 
Gemmatimonadetes (class) | 
Alphaproteobacteria I 
Sphlngobacteria L 
Ktedonobacteria J 
Thermomicrabia (class) \ 

Cytophagia | 
Flavobacteria 

Nitrospira (class) I 
CNamydiae (class) | 
Cyanobacteria | 
hSified (derived from Bacteroidetes) 



0.033 



Fed (derived from Thaumarchaeota) ^ 

Thermoprotei 
ssified (derived from Crenarchaeotal | 1 



Thermoplasmata p 



0.1 -0.12-0.10-008-0.06-0.04-0.02 0.00 0.02 0.04 0.06 
Proportion (%) Difference between proportions (%) 




1.27e E> 
7.50e-4 



| PD 

I JC 



Figure 1. Comparative taxonomic profile of the PD and JC samples at class level, computed by MG-RAST. Classes with significant biological 
differences (P < 0.05, difference between the proportions >1% and twofold of ratio between the proportions, STAMP) for the Bacteria domain 
(A); for the Archaea domain (B); and for the Eukaryota domain (C). 
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Figure 2. Comparative taxonomic profile of the PD and JC samples at genus level, computed by MG-RAST (A) and MEGAN (B). The twenty most 
abundant genera found in each samples are shown. *Significant differences between PD and JC samples (P < 0.05, difference between the 
proportions >1 % and twofold of ratio between the proportions). 



(Acidobacteria), Ktedonobacter (Chloroflexi), and Catenu- 
lispora (Actinobacteria). In JC sample, the genera with the 
significant representation were Conexibacter, Nocardioides, 
Rubrobacter, Geodermatophilus (Actinobacteria), and Gem- 
matimmonas (Gemmatimonadetes) (Fig. 2). For PD, 
Bradyrhizobium, Burkholderia, Microvirga (Proteobacteria), 
and Gemmata (Planctomycetes) were significant only con- 
sidering the MEGAN analysis. Additionally, Rhodococcus, 
Actinoplanes (Actinobacteria), and Candidatus choracido- 
bacterium (Acidobacteria) were indicated as statistically 
meaningful for JC metagenome by MEGAN (Fig. 2). 

Among Archaea classes, Halobacteria (Euryarchaeota 
phylum) and Thermoprotei (Crenarchaeota phylum) 
showed significant results in JC and PD metagenomes, 
respectively, although with low occurrence (Fig. 1). 

The PD and JC samples presented a wide variation in 
hits for Eukaryota domain and low occurrence of viruses. 
Despite that there was a similar representation of 
the most abundant phyla in the two metagenomes, the 
number of sequences assigned to Eukaryota was around 



10-fold higher in PD. The major contribution to Eu- 
karyota microbiota in PD came from Ascomycota phy- 
lum, with predominance of Eurotiomycetes, followed by 
Sordariomycetes and Dothideomycetes classes (Fig. 1). 
Among Eurotiomycetes, Aspergillus was the most frequent 
genus, with 0.171% of all hits for PD, according to the 
MG-RAST analyses. The most abundant Ascomycota in 
the JC sample was Nectria haematococca mp VI 77-13-4, 
which is a plant pathogen and it is the teleomorph (sex- 
ual reproductive stage) of Fusarium solani. 

As the tropical forest and arid soil representatives, the 
PD and JC samples were compared with public metage- 
nomes. Interestingly, the principal component analysis 
showed that PD and JC were more similar to each other 
than to other metagenomes (Fig. 3), and were considered 
as one group in order to compare with other groups such 
as rhizosphere and desert. 

The taxonomic profile indicates that the PD and JC 
group clustered near desert, with 80.2% and 72.4% of 
variance being explained by the first two components for 
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Figure 3. Trends in the PD and JC taxonomy for the class level (A) and for SEED subsystems at level 1 (B) examined using Principal Component 
Analysis (PCA) through the STAMP software, based on multiple group analysis, applying ANOVA test, Games-Howell post hoc test for confidence 
interval method and Benjamin-Hochberg FDR for correction. 
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class and genus, respectively (Fig. 3A). Although not sig- 
nificant, the PD and JC samples differ from desert soils in 
the Alphaproteobacteria proportion, which is two times 
higher in the Brazilian biomes (13.6% and 7.5%) (Fig. 4). 
In contrast, the desert samples showed a slight increase in 
Actinobacteria (33.2% and 32.1%) (Fig. 4), especially in 
relation to the Rubrobacter genus (3.3% and 0.6%), and a 
lower Mycobacterium representativeness (0.2% and 2.4%) 



(Fig. 5). Moreover, the analyses showed that in PD and 
JC there is an overrepresentation of Methylobacter, Nitro- 
coccus, and Psychromonas genera (Gammaproteobacteria 
class), which are statistically relevant (Fig. 5). In deserts, 
these genera are practically absent. 

In comparison with metagenomes from the rhizo- 
sphere, PD and JC showed high divergence in the Actino- 
bacteria abundance (32.1% and 7.9%, respectively) 
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(Fig. 4), particularly with Conexibacter (3.6% and 0.6%), 
Mycobacterium (2.4% and 1.0%), and Streptotnyces (2.3% 
and 0.6%) genera (Fig. 5). In contrast, Rhizosphere pre- 
sented an elevated proportion of Betaproteobacteria 
(8.2% and 2.7%), Gammaproteobacteria (4.2% and 
1.6%), and Deltaproteobacteria (4% and 2.2%) (Fig. 4). 
At genus level, a higher abundance of Chitinophaga was 
evident in rhizosphere (2.1% and 0.1%, respectively) 
(Fig. 5). Despite the divergence observed in the class 
abundance, it was not statistically significant. 

Comparative functional profiles 

The functional profile obtained using MG-RAST and 
STAMP did not show discrepant difference in proportion 
for subsystems at level 1 (considering differences of at 
least 1% and a twofold ratio between the proportions) 
(Fig. 6). However, some subsystems of level 1 present a 
higher representation in PD or JC with a P < 0.05 



(Fig. 6). In addition, significant differences in proportion 
were observed at levels 2 and 3. At level 1, the carbohy- 
drates subsystems and clustering-based subsystem, (which 
groups hypothetical protein families based on conserved 
colocalization across multiple genomes), were the most 
abundant in both JC and PD sample. At level 3, the sub- 
systems serine-glyoxylate cycle and YgfZ showed the high- 
est representation. Subsystems such as amino acids and 
derivatives, DNA metabolism, stress response and nitro- 
gen metabolism are highlighted in JC sample, mainly in 
relation to genes involved with degradation of amino 
acids, DNA repair and replication, ammonia assimilation 
and nitrate and nitrite ammonification. PD sample shows 
the highest representation of the subsystems virulence, 
disease and defense, metabolism of aromatic compounds, 
and phosphorus metabolism, with significant representa- 
tion of functions related to resistance to antibiotics and 
toxic compounds, benzoate degradation, and phosphorus 
uptake (Fig. 6). The analysis of taxonomical distribution 
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Figure 6. Comparative functional profile of PD (in black) and JC (in gray) samples identified by MG-RAST and statistically analyzed by STAMP. (A) 
Subsystems at level 1 (*P < 0.05). (B) Subsystems at level 2 (P < 0.05, difference between the proportions >1% and a twofold ratio between the 
proportions). (C) Abundance of subsystems at level 3. 



among the reads annotated in each subsystem showed 
that in PD the genera Rhodopseudomonas, Candidatus sol- 
ibacter, Candidatus koribacter, Bradyrhizobium, Burkholde- 
ria, Nitrobactej and Frankia present the highest 
contribution for the functional differences observed, while 
in JC the main genera found were Mycobacterium, Nocar- 
dioides, Rubrobacter, and Burkholderia (Figure S3). The 
functional profile obtained using MEGAN was similar to 
the MG-RAST data (data not shown). 

The biogeochemical cycle analyses evaluated the nitro- 
gen and methane metabolism (Fig. 7), showing that the 



metagenomes have similar profiles in relation to the 
enzymes involved in these metabolic pathways, with some 
exceptions. Concerning to the nitrogen metabolism 
(Fig. 7A), singular patterns in the proportion of hits were 
observed in JC. A higher representation of enzymes asso- 
ciated to ammonia production and conversion into 
amino acids was found. In the methane metabolism cycle 
(Fig. 7B), PD showed a higher occurrence of the carbon 
monoxide dehydrogenase (ferredoxin) (EC 1.2.99.2), an 
enzyme involved in the oxidation of CO to C0 2 , while JC 
had a higher abundance of catalase (EC 1.11.1.6) and 
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Figure 7. Nitrogen (A) and Methane (B) metabolic pathways performed by KEGG mapper from MG-RAST, with the hit number obtained for each 
EC number in relation to PD (blue) and JC (red) (adapted by Kanehisa - www.genome.jp/kegg). 



glycine hydroxymethyltransferase (EC 2.1.2.1), which acts 
in the methylenetetrahydrofolate to serine conversion. 

The screening of the functional key enzyme (Fig. 8) 
showed a dominance of Proteobacteria and Actinobacteria 
hits. Although Acidobacteria was one of the dominant 
phyla in the PD sample, few hits assigned to this phylum 
have been identified among the key genes analyzed. The 
highest number of hits was observed for virulence and 
pathogenicity, CO oxidation, and acidity. Differences 
between the samples were identified in the carbon fixa- 
tion, CO oxidation, and acidity resistance categories, 
which were predominant in the PD sample, while the vir- 
ulence, pathogenicity, and nitrate respiration were pre- 
dominant in the JC sample. 



When comparing PD and JC reads with public metage- 
nome considering the functional profiles, the principal 
component analysis showed that PD and JC were more 
similar to the rhizosphere metagenome (Fig. 3B), differing 
from the taxonomic profile which showed PD and JC me- 
tagenomes close to desert microbiota (Fig. 3A). 

As a group, PD and JC did not present significant cate- 
gories in comparison with the rhizosphere samples. How- 
ever, when comparing a rhizosphere sample to PD or JC 
individually, the rhizosphere metagenomes have more 
sequences related to the DNA metabolism and iron acqui- 
sition and metabolism at level 1, whereas PD presented a 
higher number of hits related to the carbohydrate and 
phosphorus metabolism. In the JC sample, the amino 
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acids and derivatives, fatty acids, lipids, and isoprenoids 
categories are increased (Figure S4). 

A significant functional variation was observed at level 
1 related to cell wall and capsule; motility and chemo- 
taxis, amino acids and derivatives, nucleosides and nucle- 
otides, and membrane transport, which were all 
predominant in the PD and JC group (Figure S5) when 
compared to desert metagenomes. 

Discussion 

The soil samples analyzed in this work are classified as 
sandy soil with very low nutrient content. The data 
obtained from JC sample are similar to that observed in 
the majority of the soils from Caatinga biome (recently 
reviewed by Giongo et al. (2011) and Menezes et al. 
2012). In contrast, PD sample differs from other Atlantic 
Forest soils that show a higher organic matter content 
and clay proportion with higher water holding capacity 
(Faoro et al. 2010; Sousa Neto et al. 2011; Vieira et al. 

2011) . These differences may be associated to the distinct 
taxonomic and functional profiles discussed below, as soil 
characteristics such as nutrient availability, moisture, and 
texture are determinants for microbial diversity and con- 
sequent biogeochemical processes (Arias et al. 2005; Chau 
et al. 2011; Saggar et al. 2013). 

The microbial diversity found in PD soil show a dis- 
tinct profile compared to the Atlantic Forest of Rio de 
Janeiro and Parana, also differing from soils of Amazon 
and Cerrado, where Acidobacteria phylum was dominant, 
accounting for between 29 and 63% of 16S rRNA 
sequences obtained (Jesus et al. 2009; Bruce et al. 2010; 
Faoro et al. 2010; Araujo et al. 2012). Moreover, a high 
occurrence of Acidobacteria was observed in soils from 
other subtropical or tropical moist forests (Kim et al. 
2012; Nie et al. 2012). It has also been one of the main 
groups found in arid or semiarid, ranging from 4% to 
18% of the sequences of 16S (Chanal et al. 2006; Bachar 
et al. 2010; Aguirre-Garrido et al. 2012), contrasting with 
the JC sample, where only 2.5% of the sequences corre- 
sponded to the Acidobacteria phylum. 

The genome analysis of three Acidobacteria species 
indicated the presence of cellulose synthesis genes and 
excreted proteins suggesting potential traits for desicca- 
tion resistance (Ward et al. 2009). However, the Acido- 
bacteria physiology is still relatively unknown and data 
about temperature and ultraviolet resistance are scarce. 
Furthermore, the discrepancy between our data and those 
obtained in other Brazilian biomes soils may be related to 
sand content in PD and JC samples, since the abundance 
of Acidobacteria was found to be higher in the clay than 
in the sand or silt fractions (Liles et al. 2010; Russo et al. 

2012) . Other aspect that is important to consider is the 



different methodology used for taxonomic analysis. In 
previous works about Atlantic forest, Amazon and Cerra- 
do, the analysis was based on 16S sequences (Jesus et al. 
2009; Bruce et al. 2010; Faoro et al. 2010; Araujo et al. 

2012) . In this work, our analysis was based on LCA 
approach of the MG-RAST and MEGAN using the total 
DNA sequences. The 16S rRNA analysis has been efficient 
for taxonomic identification, however, expanding the 
number of markers to include other highly conserved 
genes has improved the phylogenetic resolution. Methods 
based on LCA or other parsimonious evolutionary princi- 
ples are useful to reduce false-positives generated by tools 
based on homology, increasing the analysis robustness 
and permitting more precise taxa abundance estimation 
(Clemente et al. 2010; Guo et al. 2013; Segata et al. 

2013) . 

Our data also showed that Actinobacteria was the dom- 
inant phylum in both samples. A high occurrence of Ac- 
tinobacteria in semiarid soils (20-50%) has been 
previously reported (Chanal et al. 2006; Bachar et al. 
2010; Koberl et al. 2011; Aguirre-Garrido et al. 2012), as 
observed in JC sample (36.4%). However, its occurrence 
in forest is generally low (<15%) (He et al. 2006; Lin 
et al. 2010; Nie et al. 2012), especially in Atlantic forest 
and Amazon soils (<5%) (Jesus et al. 2009; Bruce et al. 
2010; Faoro et al. 2010), differing from PD sample, in 
which Actinobacteria recorded 27.8% of the hits. Some 
authors have proposed that in vegetated soil, rhizosphere 
zone or under plant canopy, the Proteobacteria occur- 
rence is high, while barren soils are characterized by Ac- 
tinobacteria or Acidobacteria abundance (Bernard et al. 
2007; Thomson et al. 2010; Bachar et al. 2012). In sandy 
soil, Actinobacteria is one of dominant classes found 
(Russo et al. 2012). Soil water content is a determinant 
factor for Actinobacteria abundance, which increases in 
arid soil due to the resistance of several species to 
drought stress (Connon et al. 2007; Brockett et al. 2012). 

Comparative metagenomics analyses have indicated 
that the substrate (i.e., soil or water) plays a fundamental 
role in determining the taxonomic and functional profiles 
of microbial communities. Therefore, soil samples tend to 
be more similar to each other in relation to taxonomy 
and the presence of environment-specific genes than sam- 
ples from other environments (Tringe et al. 2005; Jeffries 
et al. 2011). In PD, sand and nutrients contents and vege- 
tation seems to be the most important factors for micro- 
bial diversity, while in JC, in addition to these same 
factors, stress conditions (caused by temperature, UV and 
drought) also affect the microbial diversity. Our samples 
differ from other soils due to high content of sand and 
differ from desert biomes due to high occurrence of supe- 
rior plants. These characteristics may explain the data 
obtained in our PCA analysis, which showed taxonomic 
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profile of PD and JC similar to desert while the functional 
profiles were similar to rhizosphere biomes. 

Among the genera more represented in PD and JC 
samples, there are mainly members of Proteobacteria, Ac- 
tinobacteria, and Acidobacteria phyla. Species of the gen- 
era as Rhodopseudomonas, Bradyrhizobium, Candidatus 
solibacter, Candidatus koribacter, Nocardioides, Rubrobact- 
er, and Geodermatophilus are described as important in N 
and C fixation and organic matter degradation, and some 
of them, as Rubrobacter xylanophilus and Geodermatophi- 
lus obscurus (mainly found in JC sample), are among the 
most resistant species to gamma and UV radiation (Jothi- 
mani et al. 2003; Iwai et al. 2005; Starkenburg et al. 2008; 
Ward et al. 2009; Ivanova et al. 2010; Chikere et al. 2011; 
Torres et al. 2011; Yuan et al. 2012). Compared to desert 
biomes, PD and JC showed a higher occurrence of the 
genera Nitrococcus, Psychromonas, and Methylobacter. 
These genera are described as involved in N and C cycles 
and stress resistant, including resistance to desiccation 
and salt stress (Bowman et al. 1993; Koops and Pommer- 
ening-Roser 2001; Riley et al. 2008; Ward 2008). 

Fungi are also an essential component of terrestrial eco- 
systems by acting as organic matter decomposers, patho- 
gens and plant-mutualists (Anderson et al. 2003; Hunt 
et al. 2004; Bue'e et al. 2009). The higher number of reads 
found in PD when compared to the JC sample may be 
explained by the plant diversity found in this biome, since 
the most represented genera identified are saprophytic or 
plant pathogens. It has been proposed that fungi are more 
important for the degradation of complex C source such as 
cellulose and lignin, while bacteria are more competitive in 
degrading simple C source. Usually, Basidiomycota is the 
dominant phylum found in soil samples (Hunt et al. 2004; 
Bue'e et al. 2009). This is in contrast to our data that indi- 
cated Ascomycota as more frequent, which is mainly attrib- 
uted to the genus Aspergillus, known as saprophytic species 
and also an opportunistic pathogen (Klich 2002; Horn 
2003; Amaike and Keller 2011). 

The rapid turnover of organic matter is observed in 
soils with high temperature and intermediary moisture 
that are the best conditions for aerobic decomposition, 
leading to a lower organic matter accumulation and high 
mineralization rate (Sayer 2006; Fierer et al. 2007), favor- 
ing the growth of copiotrophic bacteria as Proteobacteria 
(Bernard et al. 2007; Thomson et al. 2010). The charac- 
teristics of vegetation are an important factor that affects 
the organic matter decomposition since recalcitrant com- 
pounds are more resistant to microbial degradation (Alli- 
son and Vitousek 2004; DeAngelis et al. 2011). As 
representative of the Atlantic forest, PD has a rich diver- 
sity of plants, predominantly arboreal, although herba- 
ceous vegetation such as grasses is also found (Freire 
1990). Caatinga biome has a lower diversity of plants 



characterized by deciduous shrubs and xerophilous spe- 
cies (Santos and Santos 2008; Santos et al. 2010, 2012). 
These characteristics may contribute to the taxonomic 
and functional profiles found in JC and PD. 

The N and P contents in soils are limiting factors, as 
microorganisms and plants compete for the nutrients. In 
soils presenting a C:N ratio less than 20:1, the organic 
matter decomposition occurs quickly, while in soils pre- 
senting C:N ratio greater than 20:1, the decomposition is 
slow (Peng et al. 2002; Rennenberg et al. 2009; Richard- 
son and Simpson 2011). In this condition, the growth of 
N 2 -fixing bacteria may be favored as observed in the PD 
sample. Other limiting factor observed in the PD soil is 
the low phosphorous content that may be related to the 
highest occurrence of hits annotated in phosphorus 
metabolism in PD compared to the JC metagenome. 
These differences may be attributed to genera Rhodo- 
pseudomonas, Candidatus solibacter, Candidatus koribacter, 
Bradyrhizobium, Burkholderia, Nitrobacter, and Fankia, 
which presented the highest representation among the 
hits related to N, C, and P metabolisms compared to JC 
(Figure S3). 

The low nutrient retention in sandy soils suggests that 
plants play an important role in maintaining the biologi- 
cal diversity due organic matter degradation and/or roots' 
exudates in rhizosphere that includes compounds that 
may be used as carbon sources by microorganisms (Bais 
et al. 2006; Jones et al. 2009). This may explain the high- 
est occurrence of hits associated to the aromatic com- 
pounds metabolism and the bacterial defense against 
toxic compounds, found mainly in PD samples. 

In JC soil, the stress caused by high temperatures, UV 
exposure, and long drought periods seems to be an 
important trigger factor for microbial diversity. This 
explains the occurrence of bacteria resistant to stress such 
as many Actinobacteria genera, as Mycobacterium, Rub- 
robacter, and Nocardioids. In fact, the highest frequency of 
functional categories related to the DNA metabolism, 
mainly DNA repair and DNA replication, and oxidative 
and osmotic stress were found JC metagenome. Concern- 
ing the N cycle, the nitrate and nitrite ammonification 
and ammonia assimilation are more represented in the JC 
metagenome. These are processes possibly related to 
organic matter decomposition and mineralization (Ren- 
nenberg et al. 2009). 

Genes related to ammonia oxidation (nitrification- 
amoA, B and C subunits) were not identified (data not 
shown) in PD and JC samples, despite the involvement of 
Bacteria and Archaea species in this N cycle step (Leinin- 
ger et al. 2006; Di et al. 2009). The low occurrence of 
genes related to N 2 0 production as NirS (only 1 hit was 
found in PD and 1 in JC) suggest a low potential 
for greenhouse gas production in these soils, as the 
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abundance of this gene was proposed as an indicator of 
greenhouse gas emission (Morales et al. 2010). Corrobo- 
rating this hypothesis, a high occurrence of genes involved 
in nitrite and ammonia assimilation suggests the retention 
of N and consequently avoiding the loss by denitrification, 
which is important in an N poor environment. 

Additionally, it is interesting to observe that both sam- 
ples have a good representation of bacteria families 
described as N 2 -fixer. However, in the PD sample the high- 
est abundance of these bacteria was found, especially when 
considering the Alphaproteobacteria class, Rhizobiales 
order. The occurrence of Bradyrhizobium, Rodopseudomon- 
as, and Nitrobacter genus infers the important role of these 
microorganisms in N and C fixation and in biodegradation 
of aromatic compounds (Jothimani et al. 2003; Starken- 
burg et al. 2008; Torres et al. 2011; Yuan et al. 2012). 

Moreover, the CO oxidation genes, especially the CO 
dehydrogenase, are more represented in the PD sample. 
The CO oxidation has been used by microorganism as a 
source of energy and carbon (King and Weber 2007). Soil 
carbon stocks are affected by addition/decomposition of 
organic matter. Acidic soils, as observed for PD and JC 
samples, are generally the most active to remove CO from 
air (Inman et al. 1971; Bartholomew and Alexander 
1979). This explains the occurrence of CO oxidation and 
acidity- related genes, particularly for the PD sample 
which has the lowest organic matter content. Further- 
more, the higher CO concentration prevents the growth 
of nitrate-respiring organisms, and the lower oxygen con- 
centration in the soil samples enables some aerobic CO- 
oxidizers to obtain energy in an anaerobic and nitrate- 
independent manner (King 2006). An important source 
for volatile compounds such as CO and C0 2 is the pho- 
todegradation of organic matter (Schade et al. 1999; 
Brandt et al. 2009), which is a dominant process in semi- 
arid ecosystems during exposure to solar radiation (Aus- 
tin and Vivanco 2006; Brandt et al. 2007; Day et al. 2007; 
Gallo et al. 2009). In agreement, JC sample showed the 
highest occurrence of C0 2 -fixation hits, mainly related to 
the Mycobacterium genus (Figure S3). 

Another curious finding is the high occurrence of the 
gene abfD (4-hydroxybutyryl-CoA dehydratase) involved 
in 3-Hydroxypropionate/4-hydroxybutyrate cycle, a C0 2 
fixation process firstly identified in Archaea species (Berg 
et al. 2007). It seems to be frequent in Bacteria phyla, 
while the classical bacteria RuBisCO genes are poorly rep- 
resented (only 3 hits were found). However, this data 
should be viewed with caution since the role of these bac- 
terial counterparts still remains unclear (Ettema and An- 
dersson 2008; Ivan et al. 2008). 

The subsystem serine-glyoxylate cycle, which is another 
pathway for C fixation, is well represented in the JC and 
PD samples and may be associated to organic matter deg- 



radation. This is an alternative pathway for mono carbon 
(CI) assimilation in methylotrophic bacteria as some 
Mycobacterium species. Methanol is very abundant in soil 
due to degradation of pectin and lignin (Kolb 2009). In 
addition, YgfZ, a folate-dependent regulatory protein 
involved in one-carbon metabolism (Teplyakov et al. 
2004), is also well represented in both JC and PD. Other 
indications of this process is the occurrence of genes 
related to Entner-Doudoroff pathway, as this alternative 
path for catabolism of glucose to pyruvate was also asso- 
ciated to pectin degradation in some bacteria (Paster and 
Canale-Parola 1985; Slovakova' et al. 2002). 

In conclusion, even with metagenomics being a power- 
ful tool in the study of microbial biodiversity, much 
remains to be understood about the biogeochemical pro- 
cesses in soils, requiring a multidisciplinary approach. 
Although there is a high similarity between the PD and 
JC samples considering the higher taxonomic and func- 
tional levels, significant differences were found in lower 
hierarchical categories. This was mainly related to the 
habitat-specific characteristics such as nutrient level, veg- 
etation, and stressful conditions. In both samples, a rapid 
turnover of organic matter with low greenhouse gas 
emission was suggested by functional profiles obtained, 
reinforcing the importance of preserving natural areas. 
Our data contribute to the understanding of soil micro- 
bial diversity in seldom assessed environments up to 
date. 
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cance. The size of the bars is scaled logarithmically to 
represent the number of reads assigned to each taxon. 
Figure S3. Taxonomical distribution relative to the SEED 
subsystems that showed differences between PD (black 
bars) and JC (gray bars). Only genus with the highest 
representation and significant differences are shown. 
Figure S4. Comparison of the functional data between: 
PD versus Holm-Oak rhizosphere (A); PD versus rice rhi- 
zosphere (B); JC versus Holm-Oak rhizosphere (C); and 
JC versus rice rhizosphere; (D) considering subsystem 
level 1. The relative proportion difference in functional 
distribution of PD and JC samples was considered signifi- 
cant when q < 0.05. 

Figure S5. Comparison computed using two groups 
analysis at subsystem level 1 for PD and JC versus deserts. 
Only significant differences are shown (by Welch's r-test, 
the Welch's inverted test for confidence interval method 
and Benjamin-Hochberg FDR for correction were 
applied) . 

Table SI. General characteristics of the PD and JC me- 
tagenomes. 
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